Support Arrow IPC Stream Files #18457

corasaurus-hex · 2025-11-03T02:32:56Z

Which issue does this PR close?

Closes Add support for registering files in the Arrow IPC stream format as tables using register_arrow or similar #16688.

Rationale for this change

Currently Datafusion can only read Arrow files if the're in the File format, not the Stream format. I work with a bunch of Stream format files and wanted native support.

What changes are included in this PR?

To accomplish the above, this PR splits the Arrow datasource into two separate implementations (ArrowStream* and ArrowFile*) with a facade on top to differentiate between the formats at query planning time.

Are these changes tested?

Yes, there are end-to-end sqllogictests along with tests for the changes within datasource-arrow.

Are there any user-facing changes?

Technically yes, in that we support a new format now. I'm not sure which documentation would need to be updated?

corasaurus-hex · 2025-11-04T07:00:33Z

datafusion/datasource-arrow/src/source.rs

+        // correct offset which is a lot of duplicate I/O. We're opting to avoid
+        // that entirely by only acting on a single partition and reading sequentially.
+        Ok(None)
+    }


this is perhaps the weightiest decision in this PR. if we want to repartition a file in the ipc stream format then we need to read from the beginning of the file for each partition, or figure out another way to create the ad-hoc equivalent of the ipc file format footer so we can minimize duplicate reads (likely by reading the entire file all the way through once and then caching the result in memory for the execution plan to use for each partition)

I'd argue that while this problem is worth solving, doing so is tangent to this change.
I'd like to see this solved, but I see no reason why we couldn't solve this in a follow-on.

Probably worth documenting the practical consequences of leaving it in this state though -- correct me if I'm wrong here, but I think this means that we end up hydrating the entire file into memory for certain operations, right? That's probably not a good long-term state.

I can't imagine this would mean I need to read the entire file into memory and keep it there? In my previous message I meant we would need to read all the record batch and dictionary locations and keep them in memory in much the same way that the arrow file format footer does. So it would mean a single pass through to record all of that and then multiple threads can seek to different parts of the file and process it.

That's my understanding of the effect of this, that it means we can't parallelize queries against this file format.

If you believe that the resulting behavior would be pathological to the extreme then we should absolutely document that. Thoughts on how we can reliably test that it is? Or who might be aware of the implications of this? And where to document it?

martin-g · 2025-11-04T12:46:00Z

datafusion/datasource-arrow/src/file_format.rs

+    );
+
    let meta_len = [meta_len[0], meta_len[1], meta_len[2], meta_len[3]];
    let meta_len = i32::from_le_bytes(meta_len);


I think it should be possible to (manually) manipulate the file's bytes in such a way that it produces a negative i32 here.
Then below the casting to usize will lead to problems.
What is the reason meta_len to be i32 instead of u32 ?

I'm honestly not sure, this was from the code that was there previously. This is the PR that introduced it initially and I don't see any information about why this choice #7962 -- @Jefffrey do you recall why i32 instead of u32? I'm happy to change it but I don't understand the implications.

Hmm, maybe I was referring to the spec: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format

And saw it say <metadata_size: int32> so I defaulted to i32 🤔

Checking for valid i32 (aka non-negative) does sound reasonable for robustness

I pushed up a refactor where I'm checking that it's not negative now, we should be good.

martin-g · 2025-11-04T12:47:04Z

datafusion/datasource-arrow/src/source.rs

+        let statistics = &self.projected_statistics;
+        Ok(statistics
+            .clone()
+            .expect("projected_statistics must be set"))


Does it need to panic here ? Would it be better to return an Err ?!

I'm not sure if we need the panic here, I'm mirroring the other Arrow FileSource. I'm happy to change both but I'll need to track down the implications of this tonight. I was trying to minimize changes in behavior since I'm a new contributor.

Best I can tell it was introduced in #14224

This is done in json, csv, avro, parquet, and this file for the arrow file format. I don't understand the implications other than it's used in every instance I can find and so I'm fine leaving it for the moment and fixing it if we have a problem later.

Jefffrey · 2025-11-05T02:08:34Z

datafusion/datasource-arrow/src/file_format.rs

+    );
+
    let meta_len = [meta_len[0], meta_len[1], meta_len[2], meta_len[3]];
    let meta_len = i32::from_le_bytes(meta_len);


Hmm, maybe I was referring to the spec: https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format

And saw it say <metadata_size: int32> so I defaulted to i32 🤔

Checking for valid i32 (aka non-negative) does sound reasonable for robustness

Jefffrey · 2025-11-05T02:10:44Z

datafusion/datasource-arrow/src/file_format.rs

 }

+// Custom implementation of inferring schema. Should eventually be moved upstream to arrow-rs.
+// See <https://github.com/apache/arrow-rs/issues/5021>


I haven't fully reviewed this PR, but just curious if you've managed to check if this code has been upstream to arrow-rs by now and we might be able to leverage it's code?

I looked into using the various readers available. FileDecoder requires a schema to create the struct which defeats the point entirely, and FileReader requires the passed-in object to support Read + Seek (we're dealing with a stream of bytes here that only does Read). I think I could keep the magic bytes handling here and then use a Cursor over the bytes already read and chain it with the remainder of the stream, passing that into a StreamReader to parse the schema. so, still a little bit of parsing but much less

Unfortunately, the fact that it's an async stream breaks a lot of the things we could do with upstream functions. We need to know how much to read off the stream to use it synchronously which means we need to do some parsing. I've significantly refactored it and like the result better but I'm going to stick with the parsing how it is, more or less.

jdcasale

I think this is basically right. Couple of nits, one question.

jdcasale · 2025-11-04T15:05:22Z

datafusion/datasource-arrow/src/file_format.rs

        conf: FileScanConfig,
    ) -> Result<Arc<dyn ExecutionPlan>> {
-        let source = Arc::new(ArrowSource::default());
+        let is_stream_format = if let Some(first_group) = conf.file_groups.first() {


Maybe worth pulling this out into a helper method that's easy to test. Also then this method reads a bit cleaner, with just a is_stream_format() check as opposed to this block of logic which is not directly relevant to creating a physical plan.

fixed! I opted to go with a positive check for whether it's in the arrow file format. Future steps that perform the actual parsing of the file should catch if it's not in the arrow stream format either.

jdcasale · 2025-11-04T15:08:52Z

datafusion/datasource-arrow/src/file_format.rs

            "Unexpected end of byte stream for Arrow IPC file".to_string(),
-        ))?;
+        )
+        .into());


return Err(...)? is redundant, you really only need either a bare Err(...)? or a return Err(...), but a bare Err(...)? looks funny to me and we still need to convert the ArrowError into a DatafusionError (which ? does for us automatically) and so we end up with return Err(...).into()

err, return Err(err.into()) in this case

jdcasale · 2025-11-05T15:26:10Z

datafusion/datasource-arrow/src/file_format.rs

+    let (meta_len, rest_of_bytes_start_index): ([u8; 4], usize) = (
+        bytes[preamble_size..preamble_size + 4]
+            .try_into()
+            .map_err(|err| {
+                ArrowError::ParseError(format!(
+                    "Unable to read IPC message as metadata length: {err:?}"
+                ))
+            })?,
+        preamble_size + 4,
+    );


Am I reading this right that rest_of_bytes_start_index is always just preamble_size + 4?

If that's the case, it may be clearer to do two separate assignments, i,.e.

Suggested change

let (meta_len, rest_of_bytes_start_index): ([u8; 4], usize) = (

bytes[preamble_size..preamble_size + 4]

.try_into()

.map_err(|err| {

ArrowError::ParseError(format!(

"Unable to read IPC message as metadata length: {err:?}"

))

})?,

preamble_size + 4,

);

let rest_of_bytes_start_index: usize = preamble_size + 4;

let meta_len: [u8; 4] = bytes[preamble_size..rest_of_bytes_start_index]

.try_into()

.map_err(|err| {

ArrowError::ParseError(format!(

"Unable to read IPC message as metadata length: {err:?}"

))

})?;

fixed in the version I'm pushing up

jdcasale · 2025-11-05T15:30:05Z

datafusion/datasource-arrow/src/source.rs

+        // correct offset which is a lot of duplicate I/O. We're opting to avoid
+        // that entirely by only acting on a single partition and reading sequentially.
+        Ok(None)
+    }


I'd argue that while this problem is worth solving, doing so is tangent to this change.
I'd like to see this solved, but I see no reason why we couldn't solve this in a follow-on.

Probably worth documenting the practical consequences of leaving it in this state though -- correct me if I'm wrong here, but I think this means that we end up hydrating the entire file into memory for certain operations, right? That's probably not a good long-term state.

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Nov 3, 2025

corasaurus-hex added 4 commits November 2, 2025 20:40

Infer stream ipc format for arrow data sources

3e6a570

Allow FileOpener for ArrowSource to open both IPC formats

0ad62ed

Split reading file vs stream because repartitioning + ranges

34ccba4

Fix rewind bug

99ebe62

corasaurus-hex force-pushed the cs--register-arrow-ipc-stream-format-files branch from 532ca54 to 99ebe62 Compare November 3, 2025 02:40

corasaurus-hex mentioned this pull request Nov 3, 2025

Add support for registering files in the Arrow IPC stream format as tables using register_arrow or similar #16688

Open

corasaurus-hex added 8 commits November 2, 2025 20:56

Remove a comment that isn't needed anymore

936b2e3

Stray reference left over from Rename Symbol fail

a8bc19d

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

93d26b1

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

21320cf

Address clippy error

3c00395

Address additional clippy errors

917c6c3

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

0f5642a

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

8941014

corasaurus-hex commented Nov 4, 2025

View reviewed changes

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

ffeca09

martin-g reviewed Nov 4, 2025

View reviewed changes

alamb mentioned this pull request Nov 4, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-03 #18486

Open

37 tasks

Jefffrey reviewed Nov 5, 2025

View reviewed changes

jdcasale reviewed Nov 5, 2025

View reviewed changes

corasaurus-hex added 6 commits November 6, 2025 17:59

Pull out the stream format check into an independent function

07593b4

Refactor schema inference

0446c32

Let's move the into() outside the parens

7409462

Err, no, on the inside

3f72b0c

Merge branch 'main' into cs--register-arrow-ipc-stream-format-files

9e63fc7

Also include a test for arrow stream source

c6d4a06

Support Arrow IPC Stream Files #18457

Are you sure you want to change the base?

Support Arrow IPC Stream Files #18457

Conversation

corasaurus-hex commented Nov 3, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corasaurus-hex Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdcasale left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corasaurus-hex Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corasaurus-hex Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdcasale Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

corasaurus-hex Nov 7, 2025 •

edited

Loading

corasaurus-hex Nov 7, 2025 •

edited

Loading

corasaurus-hex Nov 7, 2025 •

edited

Loading

jdcasale Nov 5, 2025 •

edited

Loading